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Abstract 

This paper describes a novel method to approximate the polynomial coefficients of regres- 
sion functions, with particular interest on multi-dimensional classification. The derivation 
is simple, and offers a fast, robust classification technique that is resistant to over-fitting. 
Keywords: General Regression, Multivariate Tools, Classification, Taylor Expansion, 
Characteristic Functions 
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1. Procedure to calculate the classification polynomial 

The goal of binary classification is to find the distinction between a signal s(x) and back- 
ground b(x) probabil i ty dis tribution. The optimal separation contours are described by 
Neyman and Pearson! (119331) . and it is well known these contours can be found by binomial 
regression (see iBishopl . 12000 ) . In neural networks it is typically a regression between target 
values ±1, and the optimal response function F(x) is related to the P(s\x) purity of signal: 
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By reordering, performing a Fourier transformation and using the Taylor expansion of F(x), 
it becomes 
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x k (s(x) + b(x)) e iuJX dx = / (s(x) - b(x)) e iuJX dx . 
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The Taylor series of characteristic functions can be expressed with the (x k ) moments of the 
corresponding distribution, which is used in the following definition of the g{x) and h{x) 
functions and their Fourier transforms g(u)) and h(uj): 
g(x) := s(x) + b(x) 
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h(x) := s(x) — b{x) 
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Substituting g(x) and /i(x) back into eq. (JJ, and exploiting that the Fourier transform of 
x k g(x) can be expressed with the kth derivative of g(co), one gets an equation that is true 
for every oj, hence the equation should hold for the coefficients of ui k for every k as follows: 
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These equations can be solved for F 3 either by a deconvolution, or by finding the solution 
for the matrix equation 
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where the l /k\ coefficients were suppressed into the F k unknowns, which also simplifies the 
later evaluation of the F{x) function. 

A possible approximation is using the upper left k x k part of the matrix, and solve the 
finite system of equations. An example can be seen on fig. dj where a 20 degree polynomial 
was used as a classifier on a Gaussian mixture sample with 10 4 events, while the testing 
was done on an independent 10 4 events. The resulting separation power is very similar 
to the theoretical optimum. Figure [Tel clearly shows, that the purity P(s\x), evaluated in 
bins of F(x) has a monotonic dependence on F(x) itself. Lower order approximations of 
F{x) may produce a non-linear, but still monotonically ascending curves, which feature is 
a requirement for a good classifier, as one can safely say that the events right to a certain 
F(x) value are more signal like than the events to the left. 

2. Optimisations for multi-dimensional input 

In case the dimension of the input is greater than one, the nth moment of the distributions 
become nth order tensors, and similarly g k , h k and F k . The equation that connects these 
three tensor series is similar to eq. ([2]), except that is has free tensor indices: 



Mi---Mfc 



E 

3=0 



a k +J pi 



(4) 



Polynomial expansion of the binary classification function 



1600 


























1400 


























1200 


























1000 














1 












800 














\ 












600 


























400 


























200 
-1 


























5 


-0.4 


-0.3 


-0.2 


-0.1 





0.1 


0.2 


0.3 


0.4 





5 















6000 












5000 












4000 












3000 












2000 












1000 

q 












2 


-1.5 


1 -0.5 0.5 


1 1.5 2 



F(x) 



(a) training sample 



(b) classification with polynomial 



/ 



I 



■•/■■ 



f 



/I 



•/ 



U 



A.. 



y 



/ 



1 1.5 2 

F(x) 



(c) monotonic prediction of the target 
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(d) the theoretically optimal separation 



Figure 1: [Tal Example distribution with a Gaussian signal (solid blue) and a background of two 
Gaussian peaks (meshed red). I lbl Separation of signal from background with a 20 degree polynomial 
F(x), comparable with the optimal separation on lldl [Tc] The purity of the signal, evaluated for 
binned F(x) values shows the linear dependency on F(x), close to the ideal value (dashed red line). 
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Although this is a tensor equation, the indices of h^ and Fl 1 ,,, Uj can be serialised, while 
can be turned into a rectangular matrix with the serialised indices of v\...Vj as 
columns, and /X1...//& as rows. These system of equations can be rewritten as a block matrix 
equation, similar to eq. (|3|). However, a (i-dimensional, nth order symmetric tensor has 
only ( n ~ ) free parameters from the possible d n . The difference can be many orders of 
magnitude even for small d and n, therefore to speed up computations and us e less memory. 



i t is b eneficiary to compactify the tensors in question, in a way described by iBallard et al. 

(mil). 



For symmetric tensors with degree n, the component belonging to an index vector 
ft = {/ii, ..., fi n } is the same for any permutation of /ij. These set of indices can be uniquely 
identified with the monomial of the index m(/2) = {m,i, ...,m^}, which is a d-dimensional 
vector, where mi is the multiplicity of the value i in the index vector \x. The multiplicity of 
a given monomial is the multinomial coefficient: 
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The tensor multiplications in question, between the tensors g iil l. i i h v x ...v J and i^...i/-, can be 
simplified by running over only the free parameters, indexed with the monomials: 
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The multiplicity factor can be factored into Fm, just as the / j\ terms before, with the 
benefit that the same terms should be used when F m is collapsed with the tensor of an 
input vector x Vl ■ • -av,-, in order to evaluate F(x). The remaining indices of <? Ml .^ fem still 
hold a fc-fold symmetry, just as h^.^ , which can be simplified with the same procedure. In 
the eq. (j3]), there is no summation over the free indices of h k , hence there is no need for the 
multiplicity factors either. This makes it possible to create a large vector of the serialised 
h k tensors, and a symmetric block matrix Gj^, containing the rectangular matrix version 
of g + i for every j and k. 

The difficulty in creating this block matrix is, that despite most of the g tensors are used 
multiple times, they have to be partitioned into matrices in different ways. The efficient way 
of storing g in memory was described above, but to create the matrix versions one needs to 
access the tensor elements according to the simplified, monomial indices m^ and L of order 
k and j for the tensor 5 J , that matches with the structure of indices of h k and F? . In 
case the elements in any of the tensors above are stored serially in lexical index order, then 
for any index fi = {/ii, • • • ,/i/t} it is true that //i < \x<i < • • • < //&. For the v indices on 
the diagonal, where u\ = v-i = ■ ■ ■ = vp., it is possible to calculate the number of elements 
with higher lexically ordered index, because those indices map the free elements of a d — u\ 
dimensional symmetric tensor of order k. The same way, the position of the generic [i index 
can be found by first finding the p\ positions for the diagonal index {/ii + 1, • • • , fXi + 1}, 
then calculating the p2 position for the r\ index, where rji = /ii, but rjj = \ii + 1 for every 
j > 1. It is done by simply subtracting from the p\ position the number of free elements in 
a d — fi2 dimensional symmetric tensor of order k — 1. Repeating this until the last element 
of the \x index, the formula to calculate its position reveals as 
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No. free elements of a sym. tensor of order k 

To match the elements of the serialised g 0k 2 tensor with the partitioned g m J , . matrix, 



IK 

k + j elements, which can be lexically ordered again with the help of its monomial. 



one only has to combine the lexically ordered indices of d 3 1 into one combined index with 



1. In this paper it is assumed that the indexing starts from /i = {1, • • • ,1}, and the first position is 
pos(» = 1 
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3. Tests and conclusions 

The following example was made on a three dimensional sample, consisting of twelve non- 
overlapping, non-symmetric Gaussian peaks as signal and a flat background. The classifier is 
a 20 degree multi-polynomial, which was found by solving a matrix e quation with 1771 X 177 1 
elements in a few seconds with the solver of the Lapack library (see I Anderson et al.l . Il999l ) . 
Calculating the elements of this matrix from 2 x 40 thousand events takes about 20 seconds 
on a single core 2 GHz computer. 
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(a) multi-dimensional classification with polynomial 



(b) monotonic prediction of the target 



Figure 2: [2a] Separation of signal (solid blue) and background (meshed red) events with the 20 
degree multi-polynomial F(x). [2b] Slightly non- linear, but still monotonic prediction of signal purity 
with F(x). 

Figure [2a] shows the histograms of the response, and although the F{x) values slightly 
overshoot ±1, the response vs. purity on fig. [2b] is still a strictly monotonically ascend- 
ing curve, assuring that F{x) approximates well the optimal classification contours. For 
higher dimensional inputs, it is usually enough to approximate the classifier function with 
a low degree polynomial to have a good estimate on the classification contours, or on the 
separating power of a new variable. Nevertheless, the method seems to be stable against 
overfitting, since as it is fed with well determined moments of the distributions, and not 
with the individual events itself; it is not expected to be sensitive to the high frequency 
noise associated with sampling. The method is also capable of fitting a non-binary target. 
In this case the h tensors are expectation values of y target multiplied with the x moments 
of the input parameters, h k = (y • x k ), while g k = (x k ). 

As a remark, it must be noted, that since certain distributions have diverging moments, 
it is beneficiary to transform the distributions into compact phase spaces prior to training, 
in order to have evaluable results. 
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